Univariate Plot Section

## [1] 1599   12
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
## [1] "3" "4" "5" "6" "7" "8"
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

The average quality is 5.6 and median of quality is 6.0. About 75% of wines have quality score under or equal to 6. About 75% of wines have fixed acidity less than 10.0. About 75% of wines have residual sugar less than 2.6, but its maximum value, 15.5 is very high(i.e. very sweet). All wines have similar density, from 0.99 to 1.00.

Histogram of Quality

Quality have all integer values. Most wines have 5 or 6 score value. It can be categorized as three grades.

Histogram of Taste & Taste.detail

  • 3 and 4 : Not Good(NG)
  • 5 and 6 : Good(GD)
  • 8 and 7 : Excellent(EX)

##   NG   GD   EX 
##   63 1319  217
##  Factor w/ 3 levels "NG","GD","EX": 2 2 2 2 2 2 2 3 3 2 ...

  • NG : 4%
  • GD : 82%
  • EX : 14%

Histogram of Fixed.Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Most wines have a fixed.acidity between 7 and 14.

Histogram of Volatile.Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

I want to remove outliers that are greater than 1.00. Then, we can look this hitoram closer. Again, most of wines have a volatile.acidity between 0.2 and 1.0.

Histogram of Citric.Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

## 
## FALSE  TRUE 
##  1467   132

Removing outliers above 1.00, histogram distribution is closer to uniform. In addition, about 10% of wines have no citric.acid.

Histogram of Residual.Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The frequency variance of residual.sugar is quite big, so I want to apply y-log scale.

I set y range to (1, 150), because a value less than 1 does not make sense in log scale. Residual sugar of most wines varies from 1.0 to 7.0.

Histogram of Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The frequency variance of chlorides is quite big, so I applied y-log scale.

I set y range to (1, 100) to make sense in log scale. Most wines have a chlorides ranging from 0.05 to 0.2.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

I deleted long tailed data for looking closer to the plot. Most wines have free.sulfur.dioxide under 40.

Histogram of Total.Sulfur.Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

In the same way to remove long tailed data in the previous, I also deleted data in the tail. Again, most wine have total.sulfur.dioxide from 5 to 160.

Histogram of Deinsity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

## [1] 3.562029e-06

Most wine have almost same density, because its variance is 3.5e-06, so small.

Histogram of pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Of course, all wines are acidic(i.e. under pH 7), because all wines have pH from 2.5 to 4.010.

Histogram of Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

I got rid of outliers in the tails for better understanding about histogram trend. Also, most wines’ sulphates have range from 0.3 to 1.2.

What is the structure of your dataset?

  1. 1599 wines in the data set with 13 features(I added one more feature, ‘taste’)
    • fixed.acidity
    • volatile.acidity
    • citric.acid
    • residual.sugar
    • chlorides
    • free.sulfur.dioxide
    • total.sulfur.dioxide
    • density
    • pH
    • sulphates
    • alcohol
    • quality
    • taste
    • taste.details
  2. Every feature has numerical type except taste and taste.detail. ‘taste’ have ordered factor with 3 levels, “NG”, “GD”, and “EX”, and taste.detail have ordered factor with 6 levels.

What is/are the main feature(s) of interest in your dataset?

The main features in this data are alcohol, residual.sugar, and quality(taste). I want to verify that which features among alcohol and residual.sugar determine a wine’s better flavor.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Some people enjoy citric.acid flavor, or density. Therefore, two features can be one of factors for good taste.

Did you create any new variables from existing variables in the dataset?

I created a variable for ‘taste’ using quality variable. Because quality has only integer variable, so I think converting it to a categorical variable is good idea. In addition, I simplified 6 steps(3~8) to 3 steps(NG, GD, EX).

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The quality variable is not tidy. It have just integer values. I think there is no need to use the feature as integer. So I decided to convert it to ordered factor with 3-levels.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

The alcohol feature has the most impact to quality. It is not ideal correlation, but it is quite high. Interesting thing is that residual.sugar is not correlated with quality. That menas the sweet wine doesn’t implies better flavored wine.

Suagar vs Quality

Actually, residual.sugar cannot show a tendecy with quality feature. It turns out that sugar and quality don’t have much relationship.

## 
## Call:
## lm(formula = quality ~ residual.sugar, data = subset(wdf, residual.sugar > 
##     0 & residual.sugar <= quantile(wdf$residual.sugar, 0.99)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6743 -0.6319  0.3560  0.3717  2.3778 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.60528    0.05277 106.220   <2e-16 ***
## residual.sugar  0.01211    0.01993   0.608    0.544    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.809 on 1581 degrees of freedom
## Multiple R-squared:  0.0002335,  Adjusted R-squared:  -0.0003989 
## F-statistic: 0.3692 on 1 and 1581 DF,  p-value: 0.5435

R^2 describes that sugar and quality have almost zero relationship.

Alcohol vs Quality

This plot shows a tendency syaing that the more alcohol, the higer quality values.

## 
## Call:
## lm(formula = quality ~ alcohol, data = subset(wdf, alcohol > 
##     0 & alcohol <= quantile(wdf$alcohol, 0.99)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8535 -0.4077 -0.1848  0.5180  2.5923 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.76698    0.18234   9.691   <2e-16 ***
## alcohol      0.37150    0.01746  21.275   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.708 on 1583 degrees of freedom
## Multiple R-squared:  0.2224, Adjusted R-squared:  0.2219 
## F-statistic: 452.6 on 1 and 1583 DF,  p-value: < 2.2e-16

R^2 value is 0.22, that means alcohol explains about 22% of the wine quality.

Chemically, acid has low pH values. In this matrix plot, we can verify the fact. We can observe that fixed.acidity and citric.acid have negative correlation with pH.

Citric Acid vs Quality

## 
## Call:
## lm(formula = quality ~ citric.acid, data = subset(wdf, citric.acid > 
##     0 & citric.acid <= quantile(wdf$citric.acid, 0.99)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9979 -0.6018  0.1152  0.4642  2.5962 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.37549    0.03904 137.697  < 2e-16 ***
## citric.acid  0.94312    0.11449   8.238 3.88e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7807 on 1449 degrees of freedom
## Multiple R-squared:  0.04474,    Adjusted R-squared:  0.04408 
## F-statistic: 67.86 on 1 and 1449 DF,  p-value: 3.877e-16

Citric.acid is also relatively highly correlated with quality.

Sulphates vs Quality

## 
## Call:
## lm(formula = quality ~ sulphates, data = subset(wdf, sulphates > 
##     0 & sulphates <= quantile(wdf$sulphates, 0.99)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.02595 -0.51097 -0.02595  0.47064  2.39707 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.44423    0.09018   49.28   <2e-16 ***
## sulphates    1.83920    0.13573   13.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7653 on 1581 degrees of freedom
## Multiple R-squared:  0.1041, Adjusted R-squared:  0.1035 
## F-statistic: 183.6 on 1 and 1581 DF,  p-value: < 2.2e-16

In addition, sulphates have also relatively high correlation value with quality. However, their R^2 values do not exaplain much information with quality.

Density vs Alcohol

## 
## Call:
## lm(formula = alcohol ~ density, data = subset(wdf, density > 
##     0 & density <= quantile(wdf$density, 0.99)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9436 -0.7025 -0.0951  0.5631  4.7621 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   321.03      12.58   25.53   <2e-16 ***
## density      -311.64      12.62  -24.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9022 on 1581 degrees of freedom
## Multiple R-squared:  0.2784, Adjusted R-squared:  0.278 
## F-statistic:   610 on 1 and 1581 DF,  p-value: < 2.2e-16

I can see a linear trend to decrease density as a wine has alcohol percentage. Also, positive R^2 value(about 0.28) supports the statement that alcohol and density have negative linearly dependce. Generally, alcohol is lighter than water, more alcohol percentage denotes less density.

Sulfur Dioxide : Total vs Free

## 
## Call:
## lm(formula = total.sulfur.dioxide ~ free.sulfur.dioxide, data = subset(wdf, 
##     free.sulfur.dioxide > 0 & free.sulfur.dioxide <= quantile(wdf$free.sulfur.dioxide, 
##         0.99)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.448 -12.888  -6.930   7.918 194.268 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         11.58440    1.15587   10.02   <2e-16 ***
## free.sulfur.dioxide  2.21727    0.06344   34.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.33 on 1581 degrees of freedom
## Multiple R-squared:  0.4359, Adjusted R-squared:  0.4355 
## F-statistic:  1222 on 1 and 1581 DF,  p-value: < 2.2e-16

Total sulfur dioxide and free sulfur dioxide have a positive linear dependence. Interestingly, variances of two variable are getting bigger as they have high values.

They are very similary compounds to each other, this trend is so trivial.

pH vs Citric Acid

## 
## Call:
## lm(formula = pH ~ citric.acid, data = subset(wdf, citric.acid > 
##     0 & citric.acid <= quantile(wdf$citric.acid, 0.99)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49303 -0.07917 -0.00429  0.08016  0.54310 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.417407   0.006417  532.52   <2e-16 ***
## citric.acid -0.403379   0.018820  -21.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1283 on 1449 degrees of freedom
## Multiple R-squared:  0.2407, Adjusted R-squared:  0.2402 
## F-statistic: 459.4 on 1 and 1449 DF,  p-value: < 2.2e-16

pH is the measure to denote how much acid a stuff is. Under pH 7, it is called acid, so it is clear that the trend that high citric.acid values make lower pH values.

pH vs Fixed Acid

## 
## Call:
## lm(formula = total.sulfur.dioxide ~ free.sulfur.dioxide, data = subset(wdf, 
##     free.sulfur.dioxide > 0 & free.sulfur.dioxide <= quantile(wdf$free.sulfur.dioxide, 
##         0.99)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.448 -12.888  -6.930   7.918 194.268 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         11.58440    1.15587   10.02   <2e-16 ***
## free.sulfur.dioxide  2.21727    0.06344   34.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.33 on 1581 degrees of freedom
## Multiple R-squared:  0.4359, Adjusted R-squared:  0.4355 
## F-statistic:  1222 on 1 and 1581 DF,  p-value: < 2.2e-16

Again, like the previous (pH vs citric.acid), fixed acidity is a factor that means lower pH.

pH vs Volatile Acid

## 
## Call:
## lm(formula = pH ~ volatile.acidity, data = subset(wdf, volatile.acidity > 
##     0 & volatile.acidity <= quantile(wdf$volatile.acidity, 0.99)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56962 -0.09856  0.00115  0.09330  0.65543 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.20335    0.01227 261.024   <2e-16 ***
## volatile.acidity  0.20435    0.02238   9.129   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1498 on 1582 degrees of freedom
## Multiple R-squared:  0.05005,    Adjusted R-squared:  0.04945 
## F-statistic: 83.34 on 1 and 1582 DF,  p-value: < 2.2e-16

This relation is interesting. Its trend is very rear, because generally strong acidity should have low pH. When I looked at R^2 value, its value is near zero. I can saty that this trend has no meaningful information.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

In the alcohol case, as its value is higer, I can see the better quality wine. However, in residual.sugar case, it was surprising. At first, I thought sweet flavor can be main factor of high quality wine, but it turned out they are totally not related.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

pH is correlated with fixed.acidity and citric.acid, and it makes sense by chemical principle. They correlated with each other, that is, they can be repeated information.

What was the strongest relationship you found?

The strongest relationship is alcohol. It seems to affect quality value by statistics analysis. The others, including sugar, citric acid, and sulphaste are too weak to discuss the relation with quality.

Multivariate Plots Section

I converted quality to categorical features, taste(3-levels). - NG : quality 3,4 - GD : quality 5,6 - EX : quality 7,8

And, I just converted quality to factor variable, taste.detail.

These four plots shows the followings :

We can check the previous again by boxplots.

Here is an interesting findings. Alcohol has no big impact on wine quality when the score is between 3 and 5. After score 6, higher alcohol percentage makes better quality wine.

In this simplied grades plot, we have similar results and findings. High alcohol percentage mainly distinguishes EX grade wine, not NG and GD grade.

Nothing stands out in the plot above. I want to compare only NG and EX, not GD. GD cases are too many.

It can be divided diagonally. This proves agian that alcohol and citric.acid have positive relationship with wine quality.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wdf)
## m2: lm(formula = I(quality) ~ I(alcohol) + sulphates, data = wdf)
## m3: lm(formula = I(quality) ~ I(alcohol) + sulphates + citric.acid, 
##     data = wdf)
## 
## =============================================
##                    m1        m2        m3    
## ---------------------------------------------
## (Intercept)      1.875***  1.375***  1.434***
##                 (0.175)   (0.177)   (0.176)  
## I(alcohol)       0.361***  0.346***  0.338***
##                 (0.017)   (0.016)   (0.016)  
## sulphates                  0.994***  0.814***
##                           (0.102)   (0.107)  
## citric.acid                          0.513***
##                                     (0.093)  
## ---------------------------------------------
## R-squared           0.227     0.270     0.284
## adj. R-squared      0.226     0.269     0.282
## sigma               0.710     0.690     0.684
## F                 468.267   294.988   210.501
## p                   0.000     0.000     0.000
## Log-likelihood  -1721.057 -1675.142 -1659.955
## Deviance          805.870   760.894   746.576
## AIC              3448.114  3358.284  3329.910
## BIC              3464.245  3379.793  3356.795
## N                1599      1599      1599    
## =============================================

As we added the main features to linear model, R^2 value were getting higher.

pH vs acidity vs quality

Its color map is not neat. It is hard to see the distribution of quality. So I will replace it to categorical variable(that’s why I added taste.detail)

For silplified version, only 3 grades.

In this time, we have too much green type dots. I want to focus on low quality and best quality wines.

I could not catch any particular relation in this graph. NG and EX both scattered evenly.

density vs alcohol vs quality

I can say that excellent grade wine have high alcohol percentage. After 12% alcohol, NG grade wines hardly show up. However, both grade have no particular density trend.

pH vs volatile.acid vs fixed acid

In this plot, we can say that high fixed acidity decreases pH, and volatile acidity have almost no effects on pH.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The relationship between Red wine quality and alcohol, citric.acid, can be shown again by using categorical variables. It is more intuitive than just numerical values.

As I investigated, alcohol and citric acid are the main factor of red wine quality.

Were there any interesting or surprising interactions between features?

Trying to visualize two extreme value(NG and EX) was very successful. Particulary, if you have too many samples, it will be good idea to compare two extreme ones.


Final Plots and Summary

Plot One

Description One

When someone tries a wine, he or she has an instinct feeling one of 3 types(“not good”, “good”, or “excellent”.) Therefore, I added new variable ‘taste’ for intuitive analysis and visual simplicity. The original data set has quality variables (“3”: 10, “4”: 53, “5”: 681, “6”: 638, “7”: 199, “8”: 18). In new variable ‘taste’, there are 63 “NG”s, 1319 “GD”s, and 217 “EX”s.

By doing this, it is possible to make the analysis intuitive preserving the trend of quality variable.

Plot Two

Description Two

I had a reasoning that a sweet wine tends to get excellent grade, because people love sweet drinks. However, it tunred out to be wrong. In this box plot, the median values for each grade are “NG” : 2.100, “GD” : 2.200, and “EX” : 2.300. The median increases from “NG” to “EX”, but the average don’t have the same trend. The mean values for each grade are “NG” : 2.685, “GD” : 2.504, and “EX” : 2.709. In addition, R^2 value between residual.sugar and quailty was 0.000233. This means there is no linear relationship between residual.sugar and quality(grade). Therefore, I can conclude that sugar is not a crucial factor to determine red wine’s quality. Every grade has almost same amount of residual sugar.

Plot Three

Description Three

Alcohol is the main reason for people to drink red wine, or other liquors, so I suspect it is main feature for high quality wine. In addition, potassium sulphate is an unique ingredient for making a red wine, so it is one of reasons that people in the world enjoy a red wine.

As I expeceted, “EX”-graded red wine tends to have 1.3% more alcohol percentage on average, than “NG” grade have. (“NG” : 10.22%, “EX” : 11.52%)

Also, sulphates is contained more 0.1513g/dm^3 on average than “NG” grade. (“NG” : 0.5922g/dm^3, “EX” : 0.7435g/dm^3)

Finally their positive correlation values with quality (Alcohol : 0.48, Sulphates : 0.25) support the statement.


Reflection

The red wine data was very tidy. So it was very convinient to handle data. However, every feature has just numerical type.

First, I added new categorical variables, ‘taste’ and ‘taste.detail’, based on numerical ‘quality’ variable. I could have simple and intuitive categorical variable preserving the tendency of ‘quality’ variable.

Next, I did the univariate analysis for all variables. Some of them contain unnecessary data, called outlier. So I had to remove or rescale axes(x or y). After that, I could have better understand a variable.

Thrid, in bivariate analysis, I needed correlation analysis and matrix pair plot of multiple variables as an overlook before choosing pairs of variables. Correlation analysis gave me a big help. By the analysis, I could suspect the main factors to determin wine quality.

In addition, I also tried to make a linear model to verify their relation. By investigating R^2 value, I could make decision how import it is for wine quality.

In multivariate analysis, I had problem to neatly visualize colormap. Numeric type variable was not so fitted to ggplot colormap scheme. I converted them to categorical variable in many ways.(I used ‘cut()’ and ‘as.factor()’)

Finally, while sugar had no effect on high quality wine, it turned out that alcohol and sulphates are the main factor to determine high grade of wine, based on statistical linear modeling. However, their R^2 values were not big enough.

For future work, we can have chances to apply other nonlinear statistic models. Then, we may have a better fitted model with larger R^2 value. Besides, it will lead us to a new aspect to analyze this red wine data.